1 An Example


original

reconstructed
load(file = "../4DSurv/4DSurv.RData")
d <- img$survData
d$stratum <- factor(d$stratum, levels = c('low risk', 'high risk'))
f <- survfit(Surv(time, status) ~ stratum, data = d)
ggsurvplot(f, risk.table = T, palette = c("blue", "red"), 
           legend = "none", conf.int = T)

d$time <- format(round(d$time, 2), nsmall = 2)

datatable(d, rownames = F,
          options = list(
                    pageLength = nrow(d),
                    scrollY = "600px", dom = 't'
                    )
          )

Kaplan–Meier plot for a conventional parameter model using a composite of manually derived volumetric measures (left plot) (Bello et al. 2019, 97)

2 Survthief


R-package digitizing and reconstructing data underlying images of Kaplan Meyer survival plots

2.1 Motivations for a digitization package


~ 350’000
N° articles for the search “Kaplan Meier” on https://bib.kuleuven.be

5%
Proportion of articles out of 160 randomly sampled research articles in the BMJ (2009-2015) that shared their dataset. (Naudet et al. 2018)

“it is an ethical obligation to responsibly share data generated by interventional clinical trials because participants have put themselves at risk.” (Naudet et al. 2018)

“The ICMJE [International Committee of Medical Journal Editors] proposed to require that deidentified individual patient data (IPD) are made publicly available no later than six months after publication of the trial results. This proposal triggered debate. In June 2017, the ICMJE stepped back from its proposal.” (Naudet et al. 2018)

2.1.1 the status quo


“In our experience, although little training is required, it takes about half an hour to obtain the initial input for one curve. If incorrect data are entered this may trap the algorithm.” (Guyot et al. 2012)

“The data prepared for the Guyot et al. method should be sufficient, and every step in the KM curves should be captured. Hundreds of data points were required for the Guyot et al. method” (Wan, Peng, and Li 2015a)

2.1.2 proposed improvement


digitize

digitize the curve with one click, at the maximal level of granularity with a “walking” algorithm along the curve.

reconstruct

recognize censoring mark positions with a Hough transform and reconstruct the survival data accordingly. If no censoring marks are available, the package implements Guyot et al’s method.

IPD survival meta analysis

survival meta analysis on entire dataset instead of on summary statistics.

2.2 preprocessing - colour quantization


the problem

original

zoomed

2.2.1 Kmeans


Assumptions
convexity, no outliers, known number of clusters

\[\underset{\mathbf{S}}{\arg \min } \sum_{i=1}^{k} \sum_{\mathbf{x} \in S_{i}}\left\|\mathbf{x}-\boldsymbol{\mu}_{i}\right\|^{2}\]

load("dfSegmented.rds")

library(plotly)

dfSmall <- df[sample(1:nrow(df), 10000),]

# original
library(shiny)
div(plot_ly(dfSmall, size = 0.1, x = ~red, y = ~green, z = ~blue, 
        marker = list(color = ~rgb(dfSmall$red, dfSmall$green, dfSmall$blue),
                      line = list(width = 0))) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'R'),
                   yaxis = list(title = 'G'),
                   zaxis = list(title = 'B'))) %>%
  layout(paper_bgcolor='rgb(254, 247, 234)'), align = "center")
pixels in RGB space (plotting 10’000 sampled points)
# segmented
div(plot_ly(dfSmall, size = 0.1, x = ~red, y = ~green, z = ~blue, 
        marker = list(color = ~rgb(dfSmall$R, dfSmall$G, dfSmall$B),
                      line = list(width = 0))) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'R'),
                   yaxis = list(title = 'G'),
                   zaxis = list(title = 'B'))) %>%
  layout(paper_bgcolor='rgb(254, 247, 234)'), align = "center")
k-means (k=4) clustered pixels in RGB space (plotting 10’000 sampled points)

2.2.2 Spectral Clustering


Assumptions
no outliers, known number of clusters

Define \(S\), an \(N \times N\) similarities confusion matrix for each possible pairs of points \(i\) and \(j\) (including \(i = j\)).

Each \(s_{ij}\) contains the standardized euclidian distance between points \(i\) and \(j\).

Standardization can be done with different kernels (here \(s_{i j}=\exp \left(-d_{i j}^{2} / c\right)\), with scale parameter \(c\) was used).

Apply KNN to \(S\) and define nearest neighbours. Replace all \(s_{ij}\)’s corresponding to non-neighbouring points by zero and name the resulting matrix \(W\) (more sparse than S).

Define a vector \(g\) such that \(g_{i}=\sum_{j} w_{i j}\): the sum of the distance weights for each point \(i\). Let \(G\) be a diagonal matrix with diagonal elements \(g_i\)’s. Define the graph Laplacian \(L = G - W\) (normalized version \(\tilde{\mathbf{L}}=\mathbf{I}-\mathbf{G}^{-1} \mathbf{W}\)).

Finally, perform K-means on \(L\) (Hastie, Tibshirani, and Friedman 2009, 544).

load("sc.rds")

div(plot_ly(data.frame(s), size = 0.1, x = ~X1, y = ~X2, z = ~X3, 
        marker = list(color = ~rgb(sc@centers[sc@.Data, 1], 
                                   sc@centers[sc@.Data, 2], 
                                   sc@centers[sc@.Data, 3]),
                      line = list(width = 0))) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'R'),
                   yaxis = list(title = 'G'),
                   zaxis = list(title = 'B'))) %>%
  layout(paper_bgcolor='rgb(254, 247, 234)'), align = "center")
Spectral Clustering (4 clusters) on a sample of 2’000 points

C++ implementation

https://github.com/yuj-umd/fastsc https://help.ubuntu.com/community/Arpack%2B%2B

2.2.3 DB SCAN


Assumptions
none of the above

Density based method

First define two parameters:

  • \(\rho\), maximum radious of a neighbourhood
  • \(\phi\) minimum number of point within a neighbourhod

Define core points, which satisfy both criteria. Define board points as being in a neighbourhood defined by \(\rho\) but that do not satisfy \(\phi\). Define a core object, a neighbourhood where both criteria are satisfied.

Define direct reachability between two points \(i\), \(j\): \(i\) is directly density reachable from \(j\), if \(i\) is within \(\rho\) of \(j\) and \(j\) is a core object.

Define generic reachability between two points \(i\), \(j\): there exists a chain of points \(k_1, k_2, ..., k_n\), where \(k_1 = i\) and \(k_n = p\), such that each point is directly reachable from the next one in the chain.

Group accordingly and discard points that are uncreachable by any other as noise (Yu-Wei 2015).

load("dbscan.rds")

div(plot_ly(data.frame(s), size = 0.1, x = ~red, y = ~green, z = ~blue, 
        marker = list(color = ~rgb(s$centerRed, s$centerGreen, s$centerBlue),
                      line = list(width = 0))) %>%
        # color = K$cluster) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'R'),
                      yaxis = list(title = 'G'),
                      zaxis = list(title = 'B'))) %>%
  layout(paper_bgcolor='rgb(254, 247, 234)'), align = "center")
DBScan (neighborhood size: 0.35) on a sample of 20’000 points

2.2.4 colour buckets [naive approach]


https://en.wikipedia.org/wiki/HSL_and_HSV

Wikipedia Definition:

\(\begin{equation} H :=\left\{\begin{array}{ll}{0,} & {\text { if } M A X=M I N \Leftrightarrow R=G=B} \\ {60^{\circ} \cdot\left(0+\frac{G-B}{M A X-M I N}\right),} & {\text { if } M A X=R} \\ {60^{\circ} \cdot\left(2+\frac{B-R}{M A X-M I N}\right),} & {\text { if } M A X=G} \\ {60^{\circ} \cdot\left(4+\frac{R-G}{M A X-M I N}\right),} & {\text { if } M A X=B}\end{array}\right. \end{equation}\)

\(\text { if } H<0^{\circ} \text { then } H :=H+360^{\circ}\)

\(\begin{equation} S :=\left\{\begin{array}{ll}{0,} & {\text { if } M A X=0 \Leftrightarrow R=G=B=0} \\ {0,} & {\text { if } M I N=1 \Leftrightarrow R=G=B=1} \\ {\frac{M A X-M I N}{1-|M A X+M I N-1|}=\frac{2 M A X-2 L}{1-|2 L-1|}=\frac{M A X-L}{\min (L, 1-L)},} & {\text { otherwise }}\end{array}\right. \end{equation}\)

\(\begin{equation} L :=\frac{M A X+M I N}{2} \end{equation}\)

library(plotwidgets)
fundCols <- hsl2rgb(t(cbind(seq(360*1/12, 360*11/12, 360*2/12), 0.5, 0.5)))
background <- hsl2rgb(t(cbind(seq(1, 360), 0.5, 0.5)))

library(ggplot2)
ggplot() + geom_rect(aes(xmin=0:359, xmax=1:360, ymin=-1, ymax=0), 
                     fill = rgb(t(background), maxColorValue = 255)) +
  geom_rect(aes(xmin=seq(0, 300, 60), xmax=seq(60, 360, 60), ymin=0, ymax=1), 
                     fill = rgb(t(fundCols), maxColorValue = 255)) +
  theme_void()
hue range and colour buckets (fixing saturation and lightness at 1/2)

hue range and colour buckets (fixing saturation and lightness at 1/2)

2.3 Supported Input


2.3.1 images

magick::magick_config() %>% unlist %>% subset(. == T) %>% names
##  [1] "cairo"              "fontconfig"         "freetype"          
##  [4] "jpeg"               "lcms"               "libopenjp2"        
##  [7] "lzma"               "pangocairo"         "pango"             
## [10] "png"                "rsvg"               "tiff"              
## [13] "webp"               "xml"                "zero-configuration"

2.3.2 pdf

2.4 types of images


  • with at risk table

    • with censoring marks -> survThief

    • without censoring marks -> Guyot

  • without at risk table

    • with censoring marks -> ?

    • without censoring marks -> Guyot

2.5 Reconstruction


library(mvbutils)
library(survThief)

foodweb(where="package:survThief",
        boxcolor = "#FC6512", textcolor = "black",
        cex = 0.7, lwd=2)

2.5.1 survThief workflow

2.5.1.1 UI

digi, digiData

2.5.1.2 Walking Algorithm

walk, jump, fillGap

2.5.1.3 Detect Censoring

detectCensoring

2.5.1.4 Combine atRisk table, censoring marks and walk

adjCensored, censoredAtRisk

2.5.1.5 Method when no censoring mark is available

guyot

2.5.2 survThief script by function

adjCensored

eval(parse(text=paste0('survThief:::', 'adjCensored')))
function (d, c) 
{
    index <- getIndex(d, c, "after adjustment")
    for (i in 2:nrow(c)) {
        if (i > 2) 
            prev <- as.character(as.numeric(index[i - 1]) + 1)
        else prev <- index[i - 1]
        prev <- as.character(min(as.numeric(prev), as.numeric(index[length(index)])))
        interval <- d[which(rownames(d) == prev):which(rownames(d) == 
            index[i]), ]
        temp <- interval[interval$status != 1, ]
        temp <- temp[order(temp$status, decreasing = T), ]
        if (nrow(temp) == 1) 
            tooSmall <- c$censored[i] * nrow(temp)
        else tooSmall <- rownames(temp[(c$censored[i] + 1):nrow(temp), 
            ])
        d <- d[!(rownames(d) %in% tooSmall), ]
    }
    d$status[d$status != 1] <- 0
    return(d)
}
<bytecode: 0x7fc9b49dab30>
<environment: namespace:survThief>

censoredAtRisk

eval(parse(text=paste0('survThief:::', 'censoredAtRisk')))
function (d, c, RGB) 
{
    index <- getIndex(d, c, "before adjustment")
    c$censored <- 0
    c$events <- 0
    for (i in 2:nrow(c)) {
        if (is.na(d[index[i], "survProb"])) {
            indexRow <- which(rownames(d) == rownames(d[index[i], 
                ]))
            index[i] <- max(which(!is.na(d[1:indexRow, "survProb"])))
        }
        c$events[i] <- round(c[[RGB]][i - 1] * (1 - d[index[i], 
            "survProb"]/d[index[i - 1], "survProb"]))
        c$censored[i] <- round((c[[RGB]][i - 1] - c[[RGB]][i]) - 
            c$events[i])
    }
    return(c)
}
<bytecode: 0x7fc9b49cd3f0>
<environment: namespace:survThief>

detectCensoring

eval(parse(text=paste0('survThief:::', 'detectCensoring')))
function (spectr, d) 
{
    library(EBImage)
    Shoriz <- matrix(c(1, 2, 1, 0, 0, 0, -1, -2, -1), nrow = 3)
    Svert <- t(Shoriz)
    imgV <- filter2(spectr, Svert)
    w <- dim(imgV)[1]
    h <- dim(imgV)[2]
    mid <- ceiling(w/2)
    d$status <- 1
    censie <- matrix(nrow = w, ncol = ncol(d))
    colnames(censie) <- colnames(d)
    censie[, "time"] <- 1:w
    for (i in 1:w) {
        censie[i, "status"] <- abs(sum((imgV[i, ] < (-0.01)) * 
            imgV[i, ]))
    }
    censie <- censie[!is.na(censie[, "time"]), ]
    rownames(censie) <- paste("sim", seq(1, nrow(censie), 1))
    d <- rbind(d, censie)
    d <- d[order(d$time), ]
    d <- d[d$status != 0, ]
    return(d)
}
<bytecode: 0x7fc9b445dbd0>
<environment: namespace:survThief>

digi

eval(parse(text=paste0('survThief:::', 'digi')))
function (img, ..., x1, x2, y1, y2) 
{
    pt_names <- c("x1", "x2", "y1", "y2")
    instructCal(pt_names)
    flush.console()
    cal <- readAndCal(img)
    missing_pt <- missing(x1) | missing(x2) | missing(y1) | missing(y2)
    if (missing_pt) {
        point_vals <- getVals(pt_names)
        for (p in names(point_vals)) assign(p, point_vals[[p]])
    }
    cat("\n\n")
    if (BW) {
        cat("..............NOW .............", "First click on the starting point of the survival curves (generally all survival curves start at the same point). Second, in the order they appear in the table of subjects at risk (atRisk), click on a region of each survival curve that clearly indicates its colour. (Do not hit ESC, close the window or press any mouse key.)", 
            "Once you are done, to exit:", " - Windows: right click on the plot area and choose 'Stop'!", 
            " - X11: hit any mouse button other than the left one.", 
            " - quartz/OS X: hit ESC", sep = "\n\n")
    }
    else {
        cat("..............NOW .............", "click on the starting point of the survival curves (generally all survival curves start at the same point). (Do not hit ESC, close the window or press any mouse key.)", 
            "Once you are done, to exit:", " - Windows: right click on the plot area and choose 'Stop'!", 
            " - X11: hit any mouse button other than the left one.", 
            " - quartz/OS X: hit ESC", sep = "\n\n")
    }
    cat("\n\n\n")
    flush.console()
    data <- digitData(...)
    cat("extracting the data from the selected survival curve(s)...")
    cat("\n\n")
    out <- list(data = data, cal = cal, x1 = x1, x2 = x2, y1 = y1, 
        y2 = y2)
    return(out)
}
<bytecode: 0x7fc9b4448ee8>
<environment: namespace:survThief>

digitData

eval(parse(text=paste0('survThief:::', 'digitData')))
function (col = "red", type = "p", ...) 
{
    type <- ifelse(type == "b", "o", type)
    type <- ifelse(type %in% c("l", "o", "p"), type, "p")
    locator(type = type, col = col, ...)
}
<bytecode: 0x7fc9b4447c40>
<environment: namespace:survThief>

fillGapStairs

eval(parse(text=paste0('survThief:::', 'fillGapStairs')))
function (line, hs, vs, he, ve, RGB) 
{
    i <- hs
    j <- vs
    while (i < he) {
        i <- i + 1
        line[i, j, 1, ] <- RGB
        if (j < ve) {
            j <- j + 1
            line[i, j, 1, ] <- RGB
        }
    }
    while (j < ve) {
        j <- j + 1
        line[i, j, 1, ] <- RGB
    }
    return(line)
}
<bytecode: 0x7fc9b44438f8>
<environment: namespace:survThief>

fillGapStraight

eval(parse(text=paste0('survThief:::', 'fillGapStraight')))
function (line, hs, vs, he, ve, RGB) 
{
    y <- t(line[hs:he, vs, 1, ])
    y[] <- RGB
    line[hs:he, vs, 1, ] <- t(y)
    z <- line[he, vs:ve, 1, ]
    z[] <- RGB
    line[he, vs:ve, 1, ] <- t(z)
    return(line)
}
<bytecode: 0x7fc9b43f1118>
<environment: namespace:survThief>

getDeaths

eval(parse(text=paste0('survThief:::', 'getDeaths')))
function (d, c, RGB) 
{
    d$deaths <- 0
    n <- c[[RGB]][1]
    prev <- 0
    for (i in 1:nrow(d)) {
        if (d$status[i] == 1) {
            if (prev > 0) {
                censies <- sum(d$status[prev:i] == 0)
                n <- n - censies
                dead <- n * (1 - d$survProb[i]/d$survProb[prev])
                d$deaths[i] <- round(dead)
                if (d$deaths[i] < 0) 
                  d$deaths[i] <- 0
                n <- n - dead
            }
            prev <- i
        }
    }
    d <- d[rep(row.names(d), d$deaths + (d$status == 0)), ]
    d <- d[colnames(d) %in% c("time", "status")]
    rownames(d) <- 1:nrow(d)
    return(d)
}
<bytecode: 0x7fc9b43df268>
<environment: namespace:survThief>

getIndex

eval(parse(text=paste0('survThief:::', 'getIndex')))
function (d, c, use = "before adjustment") 
{
    options(warn = -1)
    index <- c()
    index[1] <- 1
    temp <- d
    for (i in 2:nrow(c)) {
        index[i] <- rownames(temp[min(which((floor(temp$time) >= 
            c$time[i]) == T)), ])
        if (index[i] == "NA") 
            index[i] <- rownames(temp[nrow(temp), ])
    }
    options(warn = 0)
    return(index)
}
<bytecode: 0x7fc9b43d0a18>
<environment: namespace:survThief>

getVals

eval(parse(text=paste0('survThief:::', 'getVals')))
function (names) 
{
    vals <- list()
    for (p in names) {
        bad <- TRUE
        while (bad) {
            input <- readline(paste("What is the return of", 
                p, "?\n"))
            bad <- length(input) > 1
            if (bad) {
                cat("Error in input! Try again\n")
            }
            else {
                bad <- FALSE
            }
        }
        vals[[p]] <- as.numeric(input)
    }
    return(vals)
}
<bytecode: 0x7fc9b43cd498>
<environment: namespace:survThief>

guyot

eval(parse(text=paste0('survThief:::', 'guyot')))
function (dat, censie, RGB) 
{
    rownames(dat) <- 1:nrow(dat)
    index <- getIndex(dat, censie)
    risk <- censie[colnames(censie) %in% c("time", RGB)]
    risk$lower <- as.numeric(index)
    risk$upper <- 0
    for (i in 1:(nrow(risk) - 1)) {
        risk$upper[i] <- risk$lower[i + 1] - 1
    }
    risk <- risk[, c(1, 3, 4, 2)]
    risk$upper[nrow(risk)] <- risk$lower[nrow(risk)]
    dat[2] <- dat[2]/100
    dat$id <- as.numeric(rownames(dat))
    dat <- dat[, c(3, 1, 2)]
    risk$id <- as.numeric(rownames(risk))
    risk <- risk[, c(5, 1, 2, 3, 4)]
    library("MASS")
    library("splines")
    library("survival")
    tot.events <- "NA"
    arm.id <- 1
    digizeit <- dat
    t.S <- digizeit[, 2]
    S <- digizeit[, 3]
    pub.risk <- risk
    t.risk <- pub.risk[, 2]
    lower <- pub.risk[, 3]
    upper <- pub.risk[, 4]
    n.risk <- pub.risk[, 5]
    n.int <- length(n.risk)
    n.t <- upper[n.int]
    arm <- rep(arm.id, n.risk[1])
    n.censor <- rep(0, (n.int - 1))
    n.hat <- rep(n.risk[1] + 1, n.t)
    cen <- rep(0, n.t)
    d <- rep(0, n.t)
    KM.hat <- rep(1, n.t)
    last.i <- rep(1, n.int)
    sumdL <- 0
    if (n.int > 1) {
        for (i in 1:(n.int - 1)) {
            n.censor[i] <- round(n.risk[i] * S[lower[i + 1]]/S[lower[i]] - 
                n.risk[i + 1])
            while ((n.hat[lower[i + 1]] > n.risk[i + 1]) || ((n.hat[lower[i + 
                1]] < n.risk[i + 1]) && (n.censor[i] > 0))) {
                if (n.censor[i] <= 0) {
                  cen[lower[i]:upper[i]] <- 0
                  n.censor[i] <- 0
                }
                if (n.censor[i] > 0) {
                  cen.t <- rep(0, n.censor[i])
                  for (j in 1:n.censor[i]) {
                    cen.t[j] <- t.S[lower[i]] + j * (t.S[lower[(i + 
                      1)]] - t.S[lower[i]])/(n.censor[i] + 1)
                  }
                  cen[lower[i]:upper[i]] <- hist(cen.t, breaks = t.S[lower[i]:lower[(i + 
                    1)]], plot = F)$counts
                }
                n.hat[lower[i]] <- n.risk[i]
                last <- last.i[i]
                for (k in lower[i]:upper[i]) {
                  if (i == 1 & k == lower[i]) {
                    d[k] <- 0
                    KM.hat[k] <- 1
                  }
                  else {
                    d[k] <- round(n.hat[k] * (1 - (S[k]/KM.hat[last])))
                    KM.hat[k] <- KM.hat[last] * (1 - (d[k]/n.hat[k]))
                  }
                  n.hat[k + 1] <- n.hat[k] - d[k] - cen[k]
                  if (d[k] != 0) 
                    last <- k
                }
                n.censor[i] <- n.censor[i] + (n.hat[lower[i + 
                  1]] - n.risk[i + 1])
            }
            if (n.hat[lower[i + 1]] < n.risk[i + 1]) 
                n.risk[i + 1] <- n.hat[lower[i + 1]]
            last.i[(i + 1)] <- last
        }
    }
    if (n.int > 1) {
        n.censor[n.int] <- min(round(sum(n.censor[1:(n.int - 
            1)]) * (t.S[upper[n.int]] - t.S[lower[n.int]])/(t.S[upper[(n.int - 
            1)]] - t.S[lower[1]])), n.risk[n.int])
    }
    if (n.int == 1) {
        n.censor[n.int] <- 0
    }
    if (n.censor[n.int] <= 0 | is.na(n.censor[n.int])) {
        cen[lower[n.int]:(upper[n.int] - 1)] <- 0
        n.censor[n.int] <- 0
    }
    if (n.censor[n.int] > 0) {
        cen.t <- rep(0, n.censor[n.int])
        for (j in 1:n.censor[n.int]) {
            cen.t[j] <- t.S[lower[n.int]] + j * (t.S[upper[n.int]] - 
                t.S[lower[n.int]])/(n.censor[n.int] + 1)
        }
        cen[lower[n.int]:(upper[n.int] - 1)] <- hist(cen.t, breaks = t.S[lower[n.int]:upper[n.int]], 
            plot = F)$counts
    }
    n.hat[lower[n.int]] <- n.risk[n.int]
    last <- last.i[n.int]
    for (k in lower[n.int]:upper[n.int]) {
        if (KM.hat[last] != 0) {
            d[k] <- round(n.hat[k] * (1 - (S[k]/KM.hat[last])))
        }
        else {
            d[k] <- 0
        }
        KM.hat[k] <- KM.hat[last] * (1 - (d[k]/n.hat[k]))
        n.hat[k + 1] <- n.hat[k] - d[k] - cen[k]
        if (n.hat[k + 1] < 0 | is.na(n.hat[k + 1])) {
            n.hat[k + 1] <- 0
            cen[k] <- n.hat[k] - d[k]
        }
        if (d[k] != 0 | is.na(d[k])) 
            last <- k
    }
    if (tot.events != "NA") {
        if (n.int > 1) {
            sumdL <- sum(d[1:upper[(n.int - 1)]])
            if (sumdL >= tot.events) {
                d[lower[n.int]:upper[n.int]] <- rep(0, (upper[n.int] - 
                  lower[n.int] + 1))
                cen[lower[n.int]:(upper[n.int] - 1)] <- rep(0, 
                  (upper[n.int] - lower[n.int]))
                n.hat[(lower[n.int] + 1):(upper[n.int] + 1)] <- rep(n.risk[n.int], 
                  (upper[n.int] + 1 - lower[n.int]))
            }
        }
        if ((sumdL < tot.events) || (n.int == 1)) {
            sumd <- sum(d[1:upper[n.int]])
            while ((sumd > tot.events) || ((sumd < tot.events) && 
                (n.censor[n.int] > 0))) {
                n.censor[n.int] <- n.censor[n.int] + (sumd - 
                  tot.events)
                if (n.censor[n.int] <= 0) {
                  cen[lower[n.int]:(upper[n.int] - 1)] <- 0
                  n.censor[n.int] <- 0
                }
                if (n.censor[n.int] > 0) {
                  cen.t <- rep(0, n.censor[n.int])
                  for (j in 1:n.censor[n.int]) {
                    cen.t[j] <- t.S[lower[n.int]] + j * (t.S[upper[n.int]] - 
                      t.S[lower[n.int]])/(n.censor[n.int] + 1)
                  }
                  cen[lower[n.int]:(upper[n.int] - 1)] <- hist(cen.t, 
                    breaks = t.S[lower[n.int]:upper[n.int]], 
                    plot = F)$counts
                }
                n.hat[lower[n.int]] <- n.risk[n.int]
                last <- last.i[n.int]
                for (k in lower[n.int]:upper[n.int]) {
                  d[k] <- round(n.hat[k] * (1 - (S[k]/KM.hat[last])))
                  KM.hat[k] <- KM.hat[last] * (1 - (d[k]/n.hat[k]))
                  if (k != upper[n.int]) {
                    n.hat[k + 1] <- n.hat[k] - d[k] - cen[k]
                    if (n.hat[k + 1] < 0) {
                      n.hat[k + 1] <- 0
                      cen[k] <- n.hat[k] - d[k]
                    }
                  }
                  if (d[k] != 0) 
                    last <- k
                }
                sumd <- sum(d[1:upper[n.int]])
            }
        }
    }
    t.IPD <- rep(t.S[n.t], n.risk[1])
    event.IPD <- rep(0, n.risk[1])
    k = 1
    for (j in 1:n.t) {
        if (d[j] != 0) {
            t.IPD[k:(k + d[j] - 1)] <- rep(t.S[j], d[j])
            event.IPD[k:(k + d[j] - 1)] <- rep(1, d[j])
            k <- k + d[j]
        }
    }
    for (j in 1:(n.t - 1)) {
        if (cen[j] != 0) {
            t.IPD[k:(k + cen[j] - 1)] <- rep(((t.S[j] + t.S[j + 
                1])/2), cen[j])
            event.IPD[k:(k + cen[j] - 1)] <- rep(0, cen[j])
            k <- k + cen[j]
        }
    }
    IPD <- matrix(c(t.IPD, event.IPD, arm), ncol = 3, byrow = F)
    IPD <- as.data.frame(IPD)
    colnames(IPD) <- c("time", "status", "arm")
    return(IPD)
}
<bytecode: 0x7fc9b43c9310>
<environment: namespace:survThief>

instructCal

eval(parse(text=paste0('survThief:::', 'instructCal')))
function (pt_names) 
{
    inst0 <- "Use your mouse, and the image, but..."
    inst1 <- "...careful how you calibrate."
    inst2 <- paste("Click IN ORDER:", paste(pt_names, collapse = ", "))
    add <- list()
    add[[1]] <- "\n  |\n  |\n  |\n  |\n  |________x1__________________\n  "
    add[[2]] <- "\n  |\n  |\n  |\n  |\n  |_____________________x2_____\n  \n"
    add[[3]] <- "\n  |\n  |\n  |\n  y1\n  |____________________________\n  \n"
    add[[4]] <- "\n  |\n  y2\n  |\n  |\n  |____________________________\n  \n"
    cat(paste(inst1, inst2, sep = "\n"))
    cat("\n\n")
    for (i in 1:4) {
        cat("    Step", i, "----> Click on", pt_names[i])
        cat(add[[i]], "\n")
    }
}
<bytecode: 0x7fc9b4312900>
<environment: namespace:survThief>

jump

eval(parse(text=paste0('survThief:::', 'jump')))
function (i, j, RGB, p) 
{
    for (J in j:height) {
        for (I in i:width) {
            if (similarColour(p[I, J, 1, ], RGB, tol) & I != 
                i & (I - i) < (width/10) & (J - j) < (height/10)) {
                return(c(I, J))
            }
        }
    }
}
<bytecode: 0x7fc9b430cd60>
<environment: namespace:survThief>

quantize

eval(parse(text=paste0('survThief:::', 'quantize')))
function (path) 
{
    b <- load.image(path)
    m <- matrix(c(c(b[, , 1, 1]), c(b[, , 1, 2]), c(b[, , 1, 
        3])), nrow = prod(dim(b)[1:2]))
    library(plotwidgets)
    hsv <- rgb2hsv(255 * t(m))
    hueCat <- findInterval(hsv[1, ], c(0, 1/6, 2/6, 3/6, 4/6, 
        5/6))
    white <- (hsv[2, ] < 0.1) * (hsv[3, ] > 0.5)
    black <- hsv[3, ] < 0.5
    fundCols <- hsl2rgb(matrix(c(360 * 1/12, 0.5, 0.5, 360 * 
        3/12, 0.5, 0.5, 360 * 5/12, 0.5, 0.5, 360 * 7/12, 0.5, 
        0.5, 360 * 9/12, 0.5, 0.5, 360 * 11/12, 0.5, 0.5), nrow = 3))
    quantized <- array(1, dim = c(nrow(m), 3))
    quantized <- black %*% t(c(0, 0, 0)) + white %*% t(c(1, 1, 
        1)) + (!white * 1) * (!black * 1) * t(fundCols[, hueCat])/255
    quantizedImage <- array(quantized, dim = dim(b))
    return(quantizedImage)
}
<bytecode: 0x7fc9b4262158>
<environment: namespace:survThief>

readAndCal

eval(parse(text=paste0('survThief:::', 'readAndCal')))
function (img) 
{
    readImg(img)
    calpoints <- locator(n = 4, type = "p", pch = 4, col = "blue", 
        lwd = 2)
    return(calpoints)
}
<bytecode: 0x7fc9b4251d58>
<environment: namespace:survThief>

readImg

eval(parse(text=paste0('survThief:::', 'readImg')))
function (img) 
{
    op <- par(mar = c(0, 0, 0, 0))
    on.exit(par(op))
    plot.new()
    img <- aperm(img[, , 1, ], c(2, 1, 3))
    rasterImage(img, 0, 0, 1, 1)
}
<bytecode: 0x7fc9b398ceb0>
<environment: namespace:survThief>

rescale

eval(parse(text=paste0('survThief:::', 'rescale')))
function (d, x1, x2, y1, y2) 
{
    d$survProb <- d$survProb + (y2[2] - y1[2]) * 100/(y2[1] - 
        y1[1])
    d$time <- d$time * (x2[1] - x1[1])/(x2[2] - x1[2])
    d$survProb <- abs(d$survProb * (y2[1] - y1[1])/(y2[2] - y1[2]))
    rownames(d) <- 1:nrow(d)
    return(d)
}
<bytecode: 0x7fc9b3988778>
<environment: namespace:survThief>

similarColour

eval(parse(text=paste0('survThief:::', 'similarColour')))
function (point, RGB, tol = 0) 
{
    if (tol == 0) {
        if (is.null(dim(point))) 
            return(prod(point == RGB))
        else return(apply(t(t(point) == RGB), 1, prod))
    }
    else {
        if (max(RGB) - min(RGB) <= 0.05) {
            if (is.null(dim(point))) 
                return(max(point) - min(point) <= 0.05)
            else return(1 * (apply(point, 1, max) - apply(point, 
                1, min) <= 0.05))
        }
        else {
            upper <- pmin(RGB + tol, c(1, 1, 1))
            lower <- pmax(RGB - tol, c(0, 0, 0))
            if (is.null(dim(point))) 
                return(prod((point <= upper) * (point >= lower)))
            else return(apply((t(point) <= upper) * (t(point) >= 
                lower), 2, prod))
        }
    }
}
<bytecode: 0x7fc9b4234200>
<environment: namespace:survThief>

spectrumBW

eval(parse(text=paste0('survThief:::', 'spectrumBW')))
function (line, img, RGB) 
{
    i <- 0
    j <- 0
    si <- 1
    stop = F
    for (I in 1:width) {
        if (sum(line[I, , 1, ])/height < 3) {
            i <- I
            for (J in 1:height) {
                if (sum(line[I, J, 1, ]) < 3) {
                  j <- J
                  stop = T
                  break
                }
            }
        }
        if (stop) 
            break
    }
    h <- round(height/40)
    spectr <- array(dim = c(width, 2 * h + 1))
    spectr[si, ] <- similarColour(img[i, (j - h):(j + h), 1, 
        ], RGB, tol)
    si <- si + 1
    while (i < width & j < height & si < width) {
        i <- i + 1
        if (sum(line[i, j, 1, ]) < 3) {
            spectr[si, ] <- similarColour(img[i, (j - h):(j + 
                h), 1, ], RGB, tol)
            si <- si + 1
        }
        else {
            i = i - 1
            j = j + 1
        }
    }
    spectr <- spectr[-(si:width), ]
    return(spectr)
}
<bytecode: 0x7fc9b41ee040>
<environment: namespace:survThief>

strataData

eval(parse(text=paste0('survThief:::', 'strataData')))
function (wlk) 
{
    clrs <- names(wlk)
    for (i in clrs) {
        wlk[[i]]$d$stratum <- i
    }
    survData <- data.frame(time = numeric(0), status = numeric(0), 
        stratum = numeric(0))
    for (i in clrs) {
        survData <- rbind(survData, wlk[[i]]$d)
    }
    survData <- survData[order(survData$time), ]
    wlk$survData <- survData
    return(wlk)
}
<bytecode: 0x7fc9b41b8c48>
<environment: namespace:survThief>

survThief

eval(parse(text=paste0('survThief:::', 'survThief')))
function (imgPath, atRisk, method = "exact", tol = 0, clickData = NULL, 
    BW = F, quantization = F, stairs = T) 
{
    library(imager)
    if (quantization) 
        pTemp <- quantize(imgPath)
    else pTemp <- load.image(imgPath)
    if (dim(pTemp)[4] == 4) 
        assign("p", as.cimg(pTemp[, , , -4]), envir = .GlobalEnv)
    else assign("p", pTemp, envir = .GlobalEnv)
    assign("width", dim(p)[1], envir = .GlobalEnv)
    assign("height", dim(p)[2], envir = .GlobalEnv)
    assign("tol", tol, envir = .GlobalEnv)
    assign("BW", BW, envir = .GlobalEnv)
    assign("stairs", tol, envir = .GlobalEnv)
    if (length(clickData) > 0) 
        clicks <- clickData
    else clicks <- digi(p)
    startI <- round(clicks$data$x[1] * width)
    startJ <- height - round(clicks$data$y[1] * height)
    clicks$cal$x <- round(clicks$cal$x * width)
    clicks$cal$y <- round(clicks$cal$y * height)
    wlk <- list()
    j <- 1
    for (clr in colnames(atRisk)[-1]) {
        j <- j + 1
        if (!is.na(clicks$data$x[j])) {
            if (BW) 
                RGB <- c(0, 0, 0)
            else RGB <- whatColourIsThis(round(clicks$data$x[j] * 
                width), height - round(clicks$data$y[j] * height))
            wlk[[clr]] <- walk(startI, startJ, RGB)
            if (method == "reconstruct") {
                wlk[[clr]]$d <- rescale(wlk[[clr]]$d, c(clicks$x1, 
                  clicks$cal$x[1]), c(clicks$x2, clicks$cal$x[2]), 
                  c(clicks$y1, clicks$cal$y[3]), c(clicks$y2, 
                    clicks$cal$y[4]))
                wlk[[clr]]$d <- guyot(wlk[[clr]]$d, atRisk, clr)
                wlk[[clr]]$d <- wlk[[clr]]$d[!names(wlk[[clr]]$d) %in% 
                  "arm"]
            }
            else {
                spectr <- spectrumBW(wlk[[clr]]$line, p, RGB)
                wlk[[clr]]$d <- detectCensoring(spectr, wlk[[clr]]$d)
                wlk[[clr]]$d <- rescale(wlk[[clr]]$d, c(clicks$x1, 
                  clicks$cal$x[1]), c(clicks$x2, clicks$cal$x[2]), 
                  c(clicks$y1, clicks$cal$y[3]), c(clicks$y2, 
                    clicks$cal$y[4]))
                wlk[[clr]]$atRisk <- censoredAtRisk(wlk[[clr]]$d, 
                  atRisk, clr)
                wlk[[clr]]$d <- adjCensored(wlk[[clr]]$d, wlk[[clr]]$atRisk)
                wlk[[clr]]$d <- getDeaths(wlk[[clr]]$d, atRisk, 
                  clr)
            }
        }
    }
    wlk <- strataData(wlk)
    return(wlk)
}
<bytecode: 0x7fc9b41b5700>
<environment: namespace:survThief>

walk

eval(parse(text=paste0('survThief:::', 'walk')))
function (i, j, RGB, surv0) 
{
    line <- array(1, dim = c(width, height, 1, 3))
    d <- matrix(nrow = width, ncol = 2)
    colnames(d) <- c("time", "survProb")
    line[i, j, 1, ] <- RGB
    p <- as.array(p)
    while (i <= width) {
        i = i + 1
        if (similarColour(p[i, j, 1, ], RGB, tol)) {
            line[i, j, 1, ] <- RGB
            d[i, ] <- c(i, height - j)
        }
        else {
            i = i - 1
            j = j + 1
            if (similarColour(p[i, j, 1, ], RGB, tol)) {
                line[i, j, 1, ] <- RGB
            }
            else {
                newCoors <- jump(i, j - 1, RGB, p)
                if (is.null(newCoors[1])) 
                  break
                if (stairs) 
                  myline <- fillGapStairs(line, i, j - 1, newCoors[1], 
                    newCoors[2], RGB)
                else myline <- fillGapStraight(line, i, j - 1, 
                  newCoors[1], newCoors[2], RGB)
                line <- myline
                i <- newCoors[1]
                j <- newCoors[2]
            }
        }
    }
    d <- as.data.frame(d)
    d <- d[complete.cases(d), ]
    d$time <- d$time - d$time[1]
    d$survProb <- d$survProb - d$survProb[1]
    d <- d[!duplicated(d$survProb), ]
    line <- as.cimg(line)
    return(list(line = line, d = d))
}
<bytecode: 0x7fc9b4198b30>
<environment: namespace:survThief>

whatColourIsThis

eval(parse(text=paste0('survThief:::', 'whatColourIsThis')))
function (i, j) 
{
    return(p[i, j, 1, ])
}
<bytecode: 0x7fc9b418e270>
<environment: namespace:survThief>

3 Method Evaluation


Simulating from

\[S(t)=Weib(\gamma, \lambda) = \exp \left(-\lambda t^{\gamma}\right)\]

\(\gamma\) and \(\lambda\) shape and scale respectively.

\(\begin{equation} E(\mathrm{X})=\lambda \Gamma(1+1 / \gamma) \end{equation}\)

assumes properly behaved data

(Wan, Peng, and Li 2015b)

Blur images?

4 Meta Survival Analysis


(Chakos et al. 2018)

4.1 meta showcase


Reconstructed images:

Not reconstructed yet

4.2 reconstructed data


load("../../reproducing survival meta analysis/meta.RData")

colnames(d)[1] <- "time in months"

datatable(cbind(format(round(d[1], 4)), d[-1]), 
          editable = TRUE,rownames = F,
          options = list(
                    pageLength = nrow(d), 
                    scrollY = "400px", dom = 't'
                    )
          )

potential variables: median age at study start.

4.3 modelling


4.3.1 Kaplan Meier


\[\widehat{S}(t)=\prod_{i : t_{i} \leq t}\left(1-\frac{d_{i}}{n_{i}}\right)\]

## KM

colnames(d)[1] <- "time"

library(survival)
fit <- survfit(Surv(time, status) ~ stratum, data=d)

library(survminer)
## pdf("results/detectCensoring V2.pdf",width=14,height=10)
KMPlot <- ggsurvplot(fit, risk.table = T, tables.theme = theme_cleantable(),
           risk.table.col = "strata", RUEpval = TRUE, 
           ggtheme = theme_bw(), risk.table.y.text=F,
           break.time.by=12)

KMPlot

4.3.2 Bayesian Weibull regression


\[\begin{equation} p\left(t_{i} | \boldsymbol{x}_{i}, v_{i}, \boldsymbol{\beta}, \alpha\right)=\alpha^{v_{i}} t_{i}^{v_{i}(\alpha-1)} \exp \left(v_{i} \boldsymbol{\beta}^{\mathrm{T}} \boldsymbol{x}_{i}-t_{i}^{\alpha} \exp \left(\boldsymbol{\beta}^{\mathrm{T}} \boldsymbol{x}_{i}\right)\right) \end{equation}\]

intercept and shape:

\[ \begin{equation} \begin{aligned} \beta_{0} & \sim \mathrm{N}\left(0,10^{2}\right) \\ \log \alpha & \sim \mathrm{N}\left(0,10^{2}\right) \end{aligned} \end{equation} \]

covariates:

\[\begin{equation} \begin{aligned} \beta_{j} & \sim \mathrm{N}\left(0, \sigma_{s}^{2} \sigma_{j}^{2}\right), \text { for } j=1, \ldots, m_{b g} \\ \sigma_{j}^{2} & \sim \operatorname{Inv}-\chi^{2}(1), \text { for } j=1, \ldots, m_{b g} \\ \sigma_{s} & \sim \operatorname{Half}-\mathrm{N}\left(0,10^{2}\right) \end{aligned} \end{equation}\]

original Weibull:

\[x \sim W \operatorname{eib}(\alpha, \lambda) = \lambda \alpha x^{\alpha-1} e^{-\lambda x^{\alpha}}\]

KM overlay and credible intervals

load("weibFit.rds")

# library(RColorBrewer)
# myColors <- brewer.pal(2,"Set1")
# names(myColors) <- levels(stanWeibFitSurvMean$treatment)
# colScale <- scale_colour_manual(name = "treatment",values = myColors)

KMPlot$plot + geom_line(data = stanWeibFitSurvMean,
                   mapping = aes(x = t,y = survival_mean, group = treatment, color = treatment)) + 
  geom_line(data = stanWeibFitSurvMean,
                   mapping = aes(x = t,y = survival_95upper, group = treatment, color = treatment), linetype = "dotted") + 
  geom_line(data = stanWeibFitSurvMean,
                   mapping = aes(x = t,y = survival_95lower, group = treatment, color = treatment), linetype = "dotted")

Posterior draws

cred bands (evtl)

Model Checking - Rhat

load("../../reproducing survival meta analysis/weibFit.RData")

s <- capture.output(stanWeibFit)
library(stringr)
## Warning: package 'stringr' was built under R version 3.5.2
sVars <- word(s[6:length(s)], 1)

print(stanWeibFit, pars = sVars[1:15])
## Inference for Stan model: weibull.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##                  mean se_mean      sd  2.5%   25%   50%    75%   97.5%
## tau_s_bg_raw     4.32    0.29    4.08  0.36  1.50  2.95   5.72   17.38
## tau_bg_raw[1]    5.30    2.17   70.17  0.16  0.40  0.80   1.88   20.05
## alpha_raw       -0.10    0.00    0.00 -0.11 -0.11 -0.10  -0.10   -0.10
## beta_bg_raw[1]   0.51    0.02    0.42  0.09  0.19  0.37   0.70    1.65
## mu              -2.45    0.00    0.11 -2.66 -2.52 -2.45  -2.38   -2.24
## beta_bg[1]       1.05    0.00    0.11  0.83  0.98  1.05   1.13    1.27
## alpha            0.36    0.00    0.01  0.33  0.35  0.36   0.37    0.39
## yhat_uncens[1] 268.94   21.61 1382.32  0.00  1.43 18.68 131.58 2172.87
## yhat_uncens[2] 262.84   18.56 1183.79  0.00  1.53 19.18 130.67 2026.01
## yhat_uncens[3] 245.00   13.71  868.06  0.00  1.68 16.83 127.94 2134.31
## yhat_uncens[4] 245.88   18.08 1037.34  0.00  1.33 17.06 120.37 1987.31
## yhat_uncens[5] 253.74   14.83  935.61  0.00  1.29 18.65 140.76 1913.95
## yhat_uncens[6] 259.45   13.70  875.99  0.00  1.61 17.77 141.35 2276.05
## yhat_uncens[7] 243.82   13.47  853.56  0.00  1.46 17.75 121.18 2100.92
## yhat_uncens[8] 264.39   26.18 1670.85  0.00  1.50 18.64 125.07 1957.48
##                n_eff Rhat
## tau_s_bg_raw     201 1.01
## tau_bg_raw[1]   1043 1.00
## alpha_raw       2132 1.00
## beta_bg_raw[1]   633 1.01
## mu              1531 1.00
## beta_bg[1]      2036 1.00
## alpha           2158 1.00
## yhat_uncens[1]  4093 1.00
## yhat_uncens[2]  4070 1.00
## yhat_uncens[3]  4007 1.00
## yhat_uncens[4]  3291 1.00
## yhat_uncens[5]  3978 1.00
## yhat_uncens[6]  4091 1.00
## yhat_uncens[7]  4013 1.00
## yhat_uncens[8]  4072 1.00
## 
## Samples were drawn using NUTS(diag_e) at Mon Feb 25 12:26:36 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Model Checking - chains

library(rstan)
## Warning: package 'StanHeaders' was built under R version 3.5.2
traceplot(stanWeibFit, par = c("alpha","mu","beta_bg"))

Model Checking - autocorrelation

library(bayesplot)
mcmc_acf(as.matrix(stanWeibFit), pars = c("alpha","mu","beta_bg[1]"))

mcmc_areas(as.matrix(stanWeibFit), pars = c("alpha","mu","beta_bg[1]"), prob = 0.95)

Model Checking - parameter significance

mcmc_areas(as.matrix(stanWeibFit), pars = c("alpha","mu","beta_bg[1]"), prob = 0.95)

Stan Code


functions {
  vector sqrt_vec(vector x) {
    vector[dims(x)[1]] res;

    for (m in 1:dims(x)[1]){
      res[m] = sqrt(x[m]);
    }

    return res;
  }

  vector bg_prior_lp(real r_global, vector r_local) {
    r_global ~ normal(0.0, 10.0);
    r_local ~ inv_chi_square(1.0);

    return r_global * sqrt_vec(r_local);
  }
}

data {
  int<lower=0> Nobs;
  int<lower=0> Ncen;
  int<lower=0> M_bg;
  vector[Nobs] yobs;
  vector[Ncen] ycen;
  matrix[Nobs, M_bg] Xobs_bg;
  matrix[Ncen, M_bg] Xcen_bg;
}

/* transformed data { */
/*   real<lower=0> tau_mu; */
/*   real<lower=0> tau_al; */
/*   tau_mu = 10.0; */
/*   tau_al = 10.0; */
/* } */

parameters {
  real<lower=0> tau_s_bg_raw;
  vector<lower=0>[M_bg] tau_bg_raw;
  real alpha_raw;
  vector[M_bg] beta_bg_raw;
  real mu;
}

transformed parameters {
  vector[M_bg] beta_bg;
  real alpha;
  beta_bg = bg_prior_lp(tau_s_bg_raw, tau_bg_raw) .* beta_bg_raw;
  alpha = exp(10 * alpha_raw);
}

model {
  yobs ~ weibull(alpha, exp(-(mu + Xobs_bg * beta_bg)/alpha));
  target += weibull_lccdf(ycen | alpha, exp(-(mu + Xcen_bg * beta_bg)/alpha));

  beta_bg_raw ~ normal(0.0, 1.0);
  alpha_raw ~ normal(0.0, 1.0);

  /* mu ~ normal(0.0, tau_mu); */
  mu ~ normal(0, 10);
}

generated quantities {
  real yhat_uncens[Nobs + Ncen];
  real log_lik[Nobs + Ncen];
  real lp[Nobs + Ncen];

  for (i in 1:Nobs) {
    lp[i] = mu + Xobs_bg[i,] * beta_bg;
    yhat_uncens[i] = weibull_rng(alpha, exp(-(mu + Xobs_bg[i,]
                                              * beta_bg)/alpha));
    log_lik[i] = weibull_lpdf(yobs[i] | alpha, exp(-(mu + Xobs_bg[i,]
                                                     * beta_bg)/alpha));
  }
  for (i in 1:Ncen) {
    lp[Nobs + i] = mu + Xcen_bg[i,] * beta_bg;
    yhat_uncens[Nobs + i] = weibull_rng(alpha, exp(-(mu + Xcen_bg[i,]
                                                     * beta_bg)/alpha));
    log_lik[Nobs + i] = weibull_lccdf(ycen[i] | alpha, exp(-(mu + Xcen_bg[i,]
                                                             * beta_bg)/alpha));
  }
}

4.3.3 Bayesian Dependent Dirichlet Process (DDP) Weibull Model


In nonparametric modeling, the number of parameters is not fixed, and often grows with the sample size. In Bayesian nonparametrics, the number of parameters is itself considered to be a random variable A prior is defined over an infinite dimensional model space, and inference is done to select the number of parameters.

4.3.3.1 Dirichlet Distribution


\[\pi=\left(\pi_{1}, \cdots, \pi_{K}\right) \sim \operatorname{Dirichlet}\left(\alpha_{1}, \cdots, \alpha_{K}\right)=\frac{\prod_{k=1}^{K} \Gamma\left(\alpha_{k}\right)}{\Gamma\left(\sum_{k=1}^{K} \alpha_{k}\right)} \prod_{k=1}^{K} \pi_{k}^{\alpha_{k}-1}\]

with the constraints

  • \(\alpha_{k} \geq 0 \forall k \text { and } \sum_{k} \alpha_{k}>0\)
  • \(\pi_{k} \geq 0 \forall k \text { and } \sum_{k=1}^{K} \pi_{k}=1\)

Example

(P. Xing, Malings, and Gao 2014)

(P. Xing, Malings, and Gao 2014)

4.3.3.2 Infinite Mixture Model (P. Xing, Malings, and Gao 2014)


Example with an infinite mixture of normals.

\[p\left(x_{n} | \pi,\left\{\mu_{k}\right\},\left\{\Sigma_{k}\right\}\right)=\sum_{k=1}^{\infty} \pi_{k} \mathcal{N}\left(x_{n} | \mu_{k}, \Sigma_{k}\right),\]

with \(\pi^{(K)} \sim \operatorname{Dirichlet}\left(\frac{\alpha}{K}, \cdots, \frac{\alpha}{K}\right)\). \(\alpha\) is a scaling parameter that is devided in equal parts along the infinite space of the Dirichlet distribution.

With high \(K\), the number of relevant components becomes independent of the number of initial components –> stick breaking.

4.3.3.3 Dirichlet Process as stick-breaking problem (Encyclopedia of Machine Learning 2010, 280)


with \(H\) a base distribution and \(\alpha\) a scaling parameter, \(G\) follows a dirichlet process, i.e. \(G \sim \mathrm{DP}(\alpha, H)\). That is if,

\[\begin{array}{ll}{\beta_{k} \sim \operatorname{Beta}(1, \alpha)} & {\theta_{k}^{*} \sim H} \\ {\pi_{k}=\beta_{k} \prod_{l=1}^{k-1}\left(1-\beta_{l}\right)} & {G=\sum_{k=1}^{\infty} \pi_{k} \delta_{\theta_{k}^{*}}}\end{array}\]

stick breaking

4.3.3.4 Bayesian Dependent Dirichlet Process (DDP) Weibull Formulation (Shi et al. 2018)


\[ \begin{aligned} y_{i}\left|\alpha_{i}, \lambda_{i}, \boldsymbol{\beta}_{i}, \mathbf{Z}_{\mathbf{i}}\right.& \sim W e i b\left(y_{i} | \alpha_{i}, \lambda_{i} \exp \left(\mathbf{Z}_{\mathbf{i}}^{\mathrm{T}} \boldsymbol{\beta}_{\boldsymbol{i}}\right)\right), \quad i=1, \ldots, n \\\left(\alpha_{i}, \lambda_{i}, \boldsymbol{\beta}_{\boldsymbol{i}}\right)|G& \sim G, \quad i=1, \ldots, n \\ G & \sim D P\left(G_{0}, \nu\right) \\ G_{0} &=G a(\lambda | \alpha_{0}, \lambda_{0}) I_{(f(\lambda), u)}(\alpha) G a\left(\alpha_{\alpha}, \lambda_{\alpha}\right) q(\boldsymbol{\beta}) \\ \lambda_{0} & \sim G a\left(\alpha_{00}, \lambda_{00}\right) \\ \nu & \sim G a(a, b) \end{aligned} \] where,

  • \(\nu\) is the concentration parameter (probably scaling parameter with above’s notation)

  • \(x \sim G a(\alpha, \lambda)\) means \(\frac{\lambda^{\alpha}}{\Gamma(\alpha)} x^{\alpha-1} e^{-\lambda x}\)

  • \(x \sim W \operatorname{eib}(\alpha, \lambda)\) means \(\lambda \alpha x^{\alpha-1} e^{-\lambda x^{\alpha}}\)

Dirichlet is the conjugate prior to mixture weights expressed by a multinomial distribution.

4.3.3.5 Modelling with package DPWeibull


library(DPWeibull)
library(fastDummies)

y <- as.matrix(d[1:2])
x <- as.matrix(cbind(1*(d$stratum == "open"), 
                     dummy_cols(d$study, remove_first_dummy = T)[-1]))

weibModelCovs <- dpweib(y ~ x)

4.3.3.6 My Objectives

draw posterior survival rate

(Shi et al. 2018)

model it in Stan

4.3.4 Cox model with interaction terms to allow curves to cross?

(Wang et al. 2017)


5 Future Applications


  • survThief as validation method for machine learning for line recognition in plots
  • generate a general survival data base

6 Future Developments


tasks <- c("handle pdfs",
  "handle black and white images",
  "better quantization",
  "additional parameters",
  "Mixture Baysian Model",
  "digitizing algorithm when censoring marks but no at risk table",
  "no clicking needed",
  "reconstruct data hazard rate curves")
priority <- 0
priorities <- cbind(tasks, priority)

datatable(priorities, editable = TRUE,
          options = list(
                    pageLength = nrow(priorities), 
                    scrollY = "400px", dom = 't'
                    )
          )

7 Session Info


devtools::session_info()
## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.1 (2018-07-02)
##  os       macOS  10.14                
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       America/New_York            
##  date     2019-03-23                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version    date       lib source                         
##  assertthat    0.2.1      2019-03-21 [1] CRAN (R 3.5.1)                 
##  backports     1.1.3      2018-12-14 [1] CRAN (R 3.5.0)                 
##  bayesplot   * 1.6.0      2018-08-02 [1] CRAN (R 3.5.0)                 
##  broom         0.5.1      2018-12-05 [1] CRAN (R 3.5.0)                 
##  callr         3.1.1      2018-12-21 [1] CRAN (R 3.5.0)                 
##  cli           1.1.0      2019-03-19 [1] CRAN (R 3.5.1)                 
##  cmprsk        2.2-7      2014-06-17 [1] CRAN (R 3.5.0)                 
##  colorspace    1.4-1      2019-03-18 [1] CRAN (R 3.5.2)                 
##  crayon        1.3.4      2017-09-16 [1] CRAN (R 3.5.0)                 
##  crosstalk     1.0.0      2016-12-21 [1] CRAN (R 3.5.0)                 
##  data.table    1.11.8     2018-09-30 [1] CRAN (R 3.5.0)                 
##  desc          1.2.0      2018-05-01 [1] CRAN (R 3.5.0)                 
##  devtools      2.0.1      2018-10-26 [1] CRAN (R 3.5.1)                 
##  digest        0.6.18     2018-10-10 [1] CRAN (R 3.5.0)                 
##  dplyr       * 0.8.0.1    2019-02-15 [1] CRAN (R 3.5.2)                 
##  DT          * 0.5        2018-11-05 [1] CRAN (R 3.5.0)                 
##  epuRate     * 0.1        2019-03-01 [1] Github (holtzy/epuRate@54b7094)
##  evaluate      0.12       2018-10-09 [1] CRAN (R 3.5.0)                 
##  fs            1.2.6      2018-08-23 [1] CRAN (R 3.5.0)                 
##  generics      0.0.2      2018-11-29 [1] CRAN (R 3.5.0)                 
##  ggplot2     * 3.1.0      2018-10-25 [1] CRAN (R 3.5.0)                 
##  ggpubr      * 0.2        2018-11-15 [1] CRAN (R 3.5.0)                 
##  ggridges      0.5.1      2018-09-27 [1] CRAN (R 3.5.0)                 
##  glue          1.3.1      2019-03-12 [1] CRAN (R 3.5.2)                 
##  gridExtra     2.3        2017-09-09 [1] CRAN (R 3.5.0)                 
##  gtable        0.2.0      2016-02-26 [1] CRAN (R 3.5.0)                 
##  highr         0.7        2018-06-09 [1] CRAN (R 3.5.0)                 
##  htmltools     0.3.6      2017-04-28 [1] CRAN (R 3.5.0)                 
##  htmlwidgets   1.3        2018-09-30 [1] CRAN (R 3.5.0)                 
##  httpuv        1.4.5.1    2018-12-18 [1] CRAN (R 3.5.0)                 
##  httr          1.4.0      2018-12-11 [1] CRAN (R 3.5.1)                 
##  inline        0.3.15     2018-05-18 [1] CRAN (R 3.5.0)                 
##  jsonlite      1.6        2018-12-07 [1] CRAN (R 3.5.0)                 
##  kernlab       0.9-27     2018-08-10 [1] CRAN (R 3.5.0)                 
##  km.ci         0.5-2      2009-08-30 [1] CRAN (R 3.5.0)                 
##  KMsurv        0.1-5      2012-12-03 [1] CRAN (R 3.5.0)                 
##  knitr       * 1.21       2018-12-10 [1] CRAN (R 3.5.1)                 
##  labeling      0.3        2014-08-23 [1] CRAN (R 3.5.0)                 
##  later         0.8.0      2019-02-11 [1] CRAN (R 3.5.2)                 
##  lattice       0.20-38    2018-11-04 [1] CRAN (R 3.5.0)                 
##  lazyeval      0.2.2      2019-03-15 [1] CRAN (R 3.5.2)                 
##  loo           2.0.0      2018-04-11 [1] CRAN (R 3.5.0)                 
##  magick        2.0        2018-10-05 [1] CRAN (R 3.5.0)                 
##  magrittr    * 1.5        2014-11-22 [1] CRAN (R 3.5.0)                 
##  Matrix        1.2-15     2018-11-01 [1] CRAN (R 3.5.0)                 
##  matrixStats   0.54.0     2018-07-23 [1] CRAN (R 3.5.0)                 
##  memoise       1.1.0      2017-04-21 [1] CRAN (R 3.5.0)                 
##  mime          0.6        2018-10-05 [1] CRAN (R 3.5.0)                 
##  munsell       0.5.0      2018-06-12 [1] CRAN (R 3.5.0)                 
##  mvbutils    * 2.8.232    2018-12-12 [1] CRAN (R 3.5.0)                 
##  nlme          3.1-137    2018-04-07 [1] CRAN (R 3.5.1)                 
##  pillar        1.3.1      2018-12-15 [1] CRAN (R 3.5.0)                 
##  pkgbuild      1.0.2      2018-10-16 [1] CRAN (R 3.5.0)                 
##  pkgconfig     2.0.2      2018-08-16 [1] CRAN (R 3.5.0)                 
##  pkgload       1.0.2      2018-10-29 [1] CRAN (R 3.5.0)                 
##  plotly      * 4.8.0      2018-07-20 [1] CRAN (R 3.5.1)                 
##  plotwidgets * 0.4        2016-09-06 [1] CRAN (R 3.5.0)                 
##  plyr          1.8.4      2016-06-08 [1] CRAN (R 3.5.0)                 
##  prettyunits   1.0.2      2015-07-13 [1] CRAN (R 3.5.0)                 
##  processx      3.2.1      2018-12-05 [1] CRAN (R 3.5.0)                 
##  promises      1.0.1      2018-04-13 [1] CRAN (R 3.5.0)                 
##  ps            1.3.0      2018-12-21 [1] CRAN (R 3.5.0)                 
##  purrr         0.3.2      2019-03-15 [1] CRAN (R 3.5.2)                 
##  R6            2.4.0      2019-02-14 [1] CRAN (R 3.5.2)                 
##  Rcpp          1.0.1      2019-03-17 [1] CRAN (R 3.5.2)                 
##  remotes       2.0.2      2018-10-30 [1] CRAN (R 3.5.0)                 
##  reshape2      1.4.3      2017-12-11 [1] CRAN (R 3.5.0)                 
##  rlang         0.3.2      2019-03-21 [1] CRAN (R 3.5.1)                 
##  rmarkdown   * 1.11       2018-12-08 [1] CRAN (R 3.5.0)                 
##  rprojroot     1.3-2      2018-01-03 [1] CRAN (R 3.5.0)                 
##  rstan       * 2.18.2     2018-11-07 [1] CRAN (R 3.5.0)                 
##  scales        1.0.0      2018-08-09 [1] CRAN (R 3.5.0)                 
##  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 3.5.0)                 
##  shiny       * 1.2.0      2018-11-02 [1] CRAN (R 3.5.0)                 
##  StanHeaders * 2.18.1     2019-01-28 [1] CRAN (R 3.5.2)                 
##  stringi       1.4.3      2019-03-12 [1] CRAN (R 3.5.2)                 
##  stringr     * 1.4.0      2019-02-10 [1] CRAN (R 3.5.2)                 
##  survival    * 2.43-3     2018-11-26 [1] CRAN (R 3.5.0)                 
##  survminer   * 0.4.3      2018-08-04 [1] CRAN (R 3.5.0)                 
##  survMisc      0.5.5      2018-07-05 [1] CRAN (R 3.5.0)                 
##  survThief   * 0.0.0.9000 2019-03-06 [1] local                          
##  tibble        2.1.1      2019-03-16 [1] CRAN (R 3.5.2)                 
##  tidyr         0.8.3      2019-03-01 [1] CRAN (R 3.5.2)                 
##  tidyselect    0.2.5      2018-10-11 [1] CRAN (R 3.5.0)                 
##  usethis       1.4.0      2018-08-14 [1] CRAN (R 3.5.0)                 
##  viridisLite   0.3.0      2018-02-01 [1] CRAN (R 3.5.0)                 
##  withr         2.1.2      2018-03-15 [1] CRAN (R 3.5.0)                 
##  xfun          0.4        2018-10-23 [1] CRAN (R 3.5.0)                 
##  xtable        1.8-3      2018-08-29 [1] CRAN (R 3.5.0)                 
##  yaml          2.2.0      2018-07-25 [1] CRAN (R 3.5.0)                 
##  zoo           1.8-4      2018-09-19 [1] CRAN (R 3.5.0)                 
## 
## [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library

8 References


Bello, Ghalib A., Timothy J. W. Dawes, Jinming Duan, Carlo Biffi, Antonio de Marvao, Luke S. G. E. Howard, J. Simon R. Gibbs, et al. 2019. “Deep-Learning Cardiac Motion Analysis for Human Survival Prediction.” Nature Machine Intelligence 1 (2): 95–104. https://doi.org/10.1038/s42256-019-0019-2.

Chakos, Adam, Dean Jbara, Tristan D. Yan, and David H. Tian. 2018. “Long-Term Survival and Related Outcomes for Hybrid Versus Traditional Arch Repair—a Meta-Analysis.” Annals of Cardiothoracic Surgery 7 (3): 319–27.

Encyclopedia of Machine Learning. 2010. Boston, MA: Springer US.

Guyot, Patricia, AE Ades, Mario Ouwens, and Nicky Welton. 2012. “Enhanced Secondary Analysis of Survival Data: Reconstructing the Data from Published Kaplan-Meier Survival Curves” 12. http://search.proquest.com/docview/950940971/.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition. Springer Series in Statistics. New York, NY: Springer New York.

Naudet, Florian, Charlotte Sakarovitch, Perrine Janiaud, Ioana Cristea, Daniele Fanelli, David Moher, and John P A Ioannidis. 2018. “Data Sharing and Reanalysis of Randomized Controlled Trials in Leading Biomedical Journals with a Full Data Sharing Policy: Survey of Studies Published in the Bmj and Plos Medicine.” BMJ 360. https://doi.org/10.1136/bmj.k400.

P. Xing, Eric, Carl Malings, and Jingkun Gao. 2014. “Bayesian Nonparametrics: Dirichlet Processes.” Carnegie Mellon University, Probabilistic graphical models 10-708,.

Shi, Yushu, Michael Martens, Anjishnu Banerjee, and Purushottam Laud. 2018. “Low Information Omnibus (Lio) Priors for Dirichlet Process Mixture Models.” Bayesian Analysis.

Wan, Xiaomin, Liubao Peng, and Yuanjian Li. 2015a. “A Review and Comparison of Methods for Recreating Individual Patient Data from Published Kaplan-Meier Survival Curves for Economic Evaluations: A Simulation Study.” PLoS One 10 (3). http://search.proquest.com/docview/1666311132/.

———. 2015b. “A Review and Comparison of Methods for Recreating Individual Patient Data from Published Kaplan-Meier Survival Curves for Economic Evaluations: A Simulation Study.” Edited by Robert KEditor Hills. PLOS ONE 10 (3): e0121353. https://doi.org/10.1371/journal.pone.0121353.

Wang, Kesheng, Ying Yu Liu, Shaoqing Gong, Chun Xu, Xin Xie, Liang Wang, and Xingguang Luo. 2017. “Bayesian Cox Proportional Hazards Model in Survival Analysis of Hace1 Gene with Age at Onset of Alzheimer’s Disease.” International Journal of Clinical Biostatistics and Biometrics 3 1.

Yu-Wei, Chiu David Chiu. 2015. Machine Learning with R Cookbook. New York, NY: Packt Publishing.